The data has detailed attributes for every player registered in the latest edition of FIFA 19 database, obtained scraping the website “sofifa.com”. Each instance is a different player, and the attributes give basic information about the players and their football skills. Basic pre-processing was done and Goal Keepers were removed for this assignment.
Please look here for the original data overview and attributes’ descriptions:
And here to get a better view of the information:
[Task 1]: Load the dataset, completing the code below (keep the dataframe name as fifa)
# Loading
fifa <- read.csv("/Users/felix/Dropbox/CMPT459/Programming_Assignments/Assignment1/fifa.csv")
[Checkpoint 1]: How many rows and columns exist?
cat(ifelse(all(dim(fifa) == c(16122, 68)), "Correct results!", "Wrong results.."))
## Correct results!
[Task 2]: Give a very brief overview of the types of each attribute and their values. HINT: Functions str, table, summary.
# Overview
# To show types of each attribute
str(fifa)
## 'data.frame': 16122 obs. of 68 variables:
## $ ID : int 158023 20801 190871 192985 183277 177003 176580 155862 188545 182521 ...
## $ Age : int 31 33 26 27 27 32 31 32 29 28 ...
## $ Overall : int 94 94 92 91 91 91 91 91 90 90 ...
## $ Potential : int 94 94 93 92 91 91 91 91 90 90 ...
## $ Value : Factor w/ 211 levels "€0","€1.1M","€1.2M",..: 17 190 19 13 208 179 196 153 190 188 ...
## $ Wage : Factor w/ 144 levels "€0","€100K","€105K",..: 95 75 56 67 65 78 82 71 34 67 ...
## $ Height : Factor w/ 20 levels "5'1","5'10","5'11",..: 9 14 11 3 10 10 12 12 12 12 ...
## $ Weight : Factor w/ 53 levels "110lbs","115lbs",..: 22 33 18 20 24 16 36 32 30 26 ...
## $ LS : int 90 94 87 85 86 80 92 76 90 81 ...
## $ ST : int 90 94 87 85 86 80 92 76 90 81 ...
## $ RS : int 90 94 87 85 86 80 92 76 90 81 ...
## $ LW : int 94 92 92 90 92 88 91 73 86 84 ...
## $ LF : int 95 93 92 90 91 87 92 74 89 85 ...
## $ CF : int 95 93 92 90 91 87 92 74 89 85 ...
## $ RF : int 95 93 92 90 91 87 92 74 89 85 ...
## $ RW : int 94 92 92 90 92 88 91 73 86 84 ...
## $ LAM : int 95 91 92 91 92 90 90 74 86 87 ...
## $ CAM : int 95 91 92 91 92 90 90 74 86 87 ...
## $ RAM : int 95 91 92 91 92 90 90 74 86 87 ...
## $ LM : int 93 91 91 91 92 89 89 75 84 85 ...
## $ LCM : int 86 84 84 90 85 91 84 78 80 89 ...
## $ CM : int 86 84 84 90 85 91 84 78 80 89 ...
## $ RCM : int 86 84 84 90 85 91 84 78 80 89 ...
## $ RM : int 93 91 91 91 92 89 89 75 84 85 ...
## $ LWB : int 66 68 68 80 69 85 74 84 64 82 ...
## $ LDM : int 63 64 63 80 66 84 73 87 65 85 ...
## $ CDM : int 63 64 63 80 66 84 73 87 65 85 ...
## $ RDM : int 63 64 63 80 66 84 73 87 65 85 ...
## $ RWB : int 66 68 68 80 69 85 74 84 64 82 ...
## $ LB : int 61 64 63 76 63 82 71 87 61 80 ...
## $ LCB : int 49 56 50 69 52 74 68 90 60 75 ...
## $ CB : int 49 56 50 69 52 74 68 90 60 75 ...
## $ RCB : int 49 56 50 69 52 74 68 90 60 75 ...
## $ RB : int 61 64 63 76 63 82 71 87 61 80 ...
## $ Crossing : int 84 84 79 93 81 86 77 66 62 88 ...
## $ Finishing : int 95 94 87 82 84 72 93 60 91 76 ...
## $ HeadingAccuracy : int 70 89 62 55 61 55 77 91 85 54 ...
## $ ShortPassing : int 90 81 84 92 89 93 82 78 83 92 ...
## $ Volleys : int 86 87 84 82 80 76 88 66 89 82 ...
## $ Dribbling : int 97 88 96 86 95 90 87 63 85 81 ...
## $ Curve : int 93 81 88 85 83 85 86 74 77 86 ...
## $ FKAccuracy : int 94 76 87 83 79 78 84 72 86 84 ...
## $ LongPassing : int 87 77 78 91 83 88 64 77 65 93 ...
## $ BallControl : int 96 94 95 91 94 93 90 84 89 90 ...
## $ Acceleration : int 91 89 94 78 94 80 86 76 77 64 ...
## $ SprintSpeed : int 86 91 90 76 88 72 75 75 78 62 ...
## $ Agility : int 91 87 96 79 95 93 82 78 78 70 ...
## $ Reactions : int 95 96 94 91 90 90 92 85 90 89 ...
## $ Balance : int 95 70 84 77 94 94 83 66 78 71 ...
## $ ShotPower : int 85 95 80 91 82 79 86 79 88 87 ...
## $ Jumping : int 68 95 61 63 56 68 69 93 84 30 ...
## $ Stamina : int 72 88 81 90 83 89 90 84 78 75 ...
## $ Strength : int 59 79 49 75 66 58 83 83 84 73 ...
## $ LongShots : int 94 93 82 91 80 82 85 59 84 92 ...
## $ Aggression : int 48 63 56 76 54 62 87 88 80 60 ...
## $ Interceptions : int 22 29 36 61 41 83 41 90 39 82 ...
## $ Positioning : int 94 95 89 87 87 79 92 60 91 79 ...
## $ Vision : int 94 82 87 94 89 92 84 63 77 86 ...
## $ Penalties : int 75 85 81 79 86 82 85 75 88 73 ...
## $ Composure : int 96 95 94 88 91 84 85 82 86 85 ...
## $ Marking : int 33 28 27 68 34 60 62 87 34 72 ...
## $ StandingTackle : int 28 31 24 58 27 76 45 92 42 79 ...
## $ SlidingTackle : int 26 23 33 51 22 73 38 91 19 69 ...
## $ Release.Clause : Factor w/ 1200 levels "","€1.1M","€1.2M",..: 284 80 285 224 181 100 163 24 80 143 ...
## $ Preferred.Foot : Factor w/ 2 levels "Left","Right": 1 2 2 2 2 2 2 2 2 2 ...
## $ Work.Rate : Factor w/ 9 levels "HighHigh","HighLow",..: 9 2 3 1 3 1 3 3 3 9 ...
## $ Position : Factor w/ 26 levels "CAM","CB","CDM",..: 21 26 14 19 11 19 23 18 26 9 ...
## $ International.Reputation: int 5 5 5 4 4 4 5 4 4 4 ...
# Overview of distribution of values for each attribute
summary(fifa)
## ID Age Overall Potential
## Min. : 16 Min. :16.00 Min. :46.00 Min. :48.00
## 1st Qu.:200917 1st Qu.:21.00 1st Qu.:62.00 1st Qu.:67.00
## Median :222028 Median :25.00 Median :66.00 Median :71.00
## Mean :215084 Mean :25.01 Mean :66.46 Mean :71.52
## 3rd Qu.:236638 3rd Qu.:28.00 3rd Qu.:71.00 3rd Qu.:75.00
## Max. :246620 Max. :41.00 Max. :94.00 Max. :95.00
##
## Value Wage Height Weight
## €1.1M : 407 €1K :4032 6'0 :2609 154lbs :1398
## €375K : 314 €2K :2503 5'10 :2444 165lbs :1368
## €325K : 308 €3K :1699 5'9 :2226 159lbs : 915
## €1.2M : 306 €4K :1156 5'11 :2054 161lbs : 896
## €425K : 301 €5K : 786 6'1 :1588 172lbs : 866
## €1M : 298 €6K : 622 6'2 :1531 163lbs : 857
## (Other):14188 (Other):5324 (Other):3670 (Other):9822
## LS ST RS LW
## Min. :33.00 Min. :33.00 Min. :33.00 Min. :27.00
## 1st Qu.:54.00 1st Qu.:54.00 1st Qu.:54.00 1st Qu.:55.00
## Median :60.00 Median :60.00 Median :60.00 Median :62.00
## Mean :59.84 Mean :59.84 Mean :59.84 Mean :61.06
## 3rd Qu.:66.00 3rd Qu.:66.00 3rd Qu.:66.00 3rd Qu.:68.00
## Max. :94.00 Max. :94.00 Max. :94.00 Max. :94.00
##
## LF CF RF RW
## Min. :29.00 Min. :29.00 Min. :29.00 Min. :27.00
## 1st Qu.:55.00 1st Qu.:55.00 1st Qu.:55.00 1st Qu.:55.00
## Median :62.00 Median :62.00 Median :62.00 Median :62.00
## Mean :60.74 Mean :60.74 Mean :60.74 Mean :61.06
## 3rd Qu.:68.00 3rd Qu.:68.00 3rd Qu.:68.00 3rd Qu.:68.00
## Max. :95.00 Max. :95.00 Max. :95.00 Max. :94.00
##
## LAM CAM RAM LM LCM
## Min. :29 Min. :29 Min. :29 Min. :29.00 Min. :32.00
## 1st Qu.:55 1st Qu.:55 1st Qu.:55 1st Qu.:56.00 1st Qu.:54.00
## Median :62 Median :62 Median :62 Median :63.00 Median :61.00
## Mean :61 Mean :61 Mean :61 Mean :61.76 Mean :60.24
## 3rd Qu.:68 3rd Qu.:68 3rd Qu.:68 3rd Qu.:68.00 3rd Qu.:66.00
## Max. :95 Max. :95 Max. :95 Max. :93.00 Max. :91.00
##
## CM RCM RM LWB
## Min. :32.00 Min. :32.00 Min. :29.00 Min. :32.00
## 1st Qu.:54.00 1st Qu.:54.00 1st Qu.:56.00 1st Qu.:53.00
## Median :61.00 Median :61.00 Median :63.00 Median :60.00
## Mean :60.24 Mean :60.24 Mean :61.76 Mean :59.53
## 3rd Qu.:66.00 3rd Qu.:66.00 3rd Qu.:68.00 3rd Qu.:66.00
## Max. :91.00 Max. :91.00 Max. :93.00 Max. :88.00
##
## LDM CDM RDM RWB
## Min. :30.00 Min. :30.00 Min. :30.00 Min. :32.00
## 1st Qu.:51.00 1st Qu.:51.00 1st Qu.:51.00 1st Qu.:53.00
## Median :60.00 Median :60.00 Median :60.00 Median :60.00
## Mean :58.85 Mean :58.85 Mean :58.85 Mean :59.53
## 3rd Qu.:66.00 3rd Qu.:66.00 3rd Qu.:66.00 3rd Qu.:66.00
## Max. :90.00 Max. :90.00 Max. :90.00 Max. :88.00
##
## LB LCB CB RCB
## Min. :31.00 Min. :27.00 Min. :27.00 Min. :27.00
## 1st Qu.:52.00 1st Qu.:48.00 1st Qu.:48.00 1st Qu.:48.00
## Median :60.00 Median :59.00 Median :59.00 Median :59.00
## Mean :58.83 Mean :57.64 Mean :57.64 Mean :57.64
## 3rd Qu.:66.00 3rd Qu.:67.00 3rd Qu.:67.00 3rd Qu.:67.00
## Max. :87.00 Max. :90.00 Max. :90.00 Max. :90.00
##
## RB Crossing Finishing HeadingAccuracy
## Min. :31.00 Min. :11.0 Min. :10.00 Min. :15.00
## 1st Qu.:52.00 1st Qu.:44.0 1st Qu.:36.00 1st Qu.:49.00
## Median :60.00 Median :56.0 Median :52.00 Median :58.00
## Mean :58.83 Mean :54.2 Mean :49.76 Mean :57.07
## 3rd Qu.:66.00 3rd Qu.:65.0 3rd Qu.:63.00 3rd Qu.:65.00
## Max. :87.00 Max. :93.0 Max. :95.00 Max. :94.00
##
## ShortPassing Volleys Dribbling Curve
## Min. :20.00 Min. :10.00 Min. :14.00 Min. :11.00
## 1st Qu.:57.00 1st Qu.:35.00 1st Qu.:55.00 1st Qu.:39.00
## Median :64.00 Median :47.00 Median :63.00 Median :52.00
## Mean :62.64 Mean :46.75 Mean :60.59 Mean :51.25
## 3rd Qu.:69.00 3rd Qu.:58.00 3rd Qu.:69.00 3rd Qu.:63.00
## Max. :93.00 Max. :90.00 Max. :97.00 Max. :94.00
##
## FKAccuracy LongPassing BallControl Acceleration
## Min. :10.00 Min. :19.0 Min. :25.0 Min. :20.00
## 1st Qu.:34.00 1st Qu.:49.0 1st Qu.:58.0 1st Qu.:62.00
## Median :44.00 Median :58.0 Median :64.0 Median :69.00
## Mean :46.43 Mean :56.1 Mean :63.2 Mean :67.92
## 3rd Qu.:58.00 3rd Qu.:65.0 3rd Qu.:70.0 3rd Qu.:76.00
## Max. :94.00 Max. :93.0 Max. :96.0 Max. :97.00
##
## SprintSpeed Agility Reactions Balance
## Min. :25.00 Min. :23.00 Min. :21.0 Min. :22.00
## 1st Qu.:62.00 1st Qu.:59.00 1st Qu.:56.0 1st Qu.:60.00
## Median :69.00 Median :68.00 Median :62.0 Median :68.00
## Mean :67.99 Mean :66.43 Mean :62.2 Mean :66.59
## 3rd Qu.:76.00 3rd Qu.:75.00 3rd Qu.:68.0 3rd Qu.:75.00
## Max. :96.00 Max. :96.00 Max. :96.0 Max. :96.00
##
## ShotPower Jumping Stamina Strength
## Min. :14.0 Min. :28.00 Min. :27.00 Min. :25.00
## 1st Qu.:51.0 1st Qu.:59.00 1st Qu.:61.00 1st Qu.:59.00
## Median :61.0 Median :67.00 Median :68.00 Median :67.00
## Mean :59.6 Mean :66.02 Mean :67.34 Mean :65.91
## 3rd Qu.:69.0 3rd Qu.:74.00 3rd Qu.:75.00 3rd Qu.:75.00
## Max. :95.0 Max. :95.00 Max. :96.00 Max. :97.00
##
## LongShots Aggression Interceptions Positioning
## Min. :11.00 Min. :13.00 Min. :10.00 Min. :11.00
## 1st Qu.:40.00 1st Qu.:50.00 1st Qu.:34.00 1st Qu.:46.00
## Median :54.00 Median :61.00 Median :56.00 Median :57.00
## Mean :51.44 Mean :59.58 Mean :50.43 Mean :54.82
## 3rd Qu.:64.00 3rd Qu.:70.00 3rd Qu.:65.00 3rd Qu.:65.00
## Max. :94.00 Max. :95.00 Max. :92.00 Max. :95.00
##
## Vision Penalties Composure Marking
## Min. :12.00 Min. :12.00 Min. :22.00 Min. :10.00
## 1st Qu.:47.00 1st Qu.:42.00 1st Qu.:53.00 1st Qu.:37.00
## Median :57.00 Median :52.00 Median :61.00 Median :56.00
## Mean :55.47 Mean :52.13 Mean :60.35 Mean :51.31
## 3rd Qu.:65.00 3rd Qu.:62.00 3rd Qu.:68.00 3rd Qu.:65.00
## Max. :94.00 Max. :92.00 Max. :96.00 Max. :94.00
##
## StandingTackle SlidingTackle Release.Clause Preferred.Foot
## Min. :10.00 Min. :10.00 : 1379 Left : 4003
## 1st Qu.:35.00 1st Qu.:32.00 €1.1M : 504 Right:12119
## Median :59.00 Median :56.00 €1.3M : 384
## Mean :51.92 Mean :49.63 €1.2M : 359
## 3rd Qu.:67.00 3rd Qu.:65.00 €1.4M : 357
## Max. :93.00 Max. :91.00 €1.5M : 314
## (Other):12825
## Work.Rate Position International.Reputation
## MediumMedium:7779 ST :2152 Min. :1.000
## HighMedium :3170 CB :1778 1st Qu.:1.000
## MediumHigh :1690 CM :1394 Median :1.000
## HighHigh :1015 LB :1322 Mean :1.115
## MediumLow : 849 RB :1291 3rd Qu.:1.000
## HighLow : 697 RM :1124 Max. :5.000
## (Other) : 922 (Other):7061
[Checkpoint 2]: Were functions used to display data types and give some idea of the information of the attributes?
Functions suggested to use on this part: ifelse, substr, nchar, str_split, map_dbl.
Five attributes need to be cleaned.
[Task 3]: The first 3 of the 5 attributes listed above that need to be cleaned are very alike. Create only one function to clean them the same way. This function should get the vector of attribute values as parameter and return it cleaned, so use it three times, each with one of the columns. Encode zeroes or blank as NA.
# Function used to clean attributes
attr_fix <- function(attribute){
cleaned_attribute <- as.character(attribute)
# Modify euro symbol character
ifelse(grepl("\u20ac",cleaned_attribute),
cleaned_attribute <- gsub('\u20ac',"",cleaned_attribute),
cleaned_attribute <- cleaned_attribute)
# Modify M character
for (i in grep("M",cleaned_attribute)){
cleaned_attribute[i] <- gsub("M","",cleaned_attribute[i])
cleaned_attribute[i] <- as.numeric(cleaned_attribute[i]) * 10^6
}
# Modify K character
for (i in grep("K",cleaned_attribute)){
cleaned_attribute[i] <- gsub("K","",cleaned_attribute[i])
cleaned_attribute[i] <- as.numeric(cleaned_attribute[i]) * 10^3
}
#encode zeroes or blank as NA
cleaned_attribute[cleaned_attribute==""] <- NA
cleaned_attribute[cleaned_attribute=="0"] <- NA
#make attribute numeric
cleaned_attribute <- as.numeric(cleaned_attribute)
return(cleaned_attribute)
}
# Cleaning attributes
fifa$Value <- attr_fix(fifa$Value)
fifa$Wage <- attr_fix(fifa$Wage)
fifa$Release.Clause <- attr_fix(fifa$Release.Clause)
[Checkpoint 3]: How many NA values?
cat(ifelse(sum(is.na(fifa))==1779, "Correct results!", "Wrong results.."))
## Correct results!
[Task 4]: Clean the other two attributes. Hint: To convert to “cm” use http://www.sengpielaudio.com/calculator-bodylength.htm.
# Cleaning attribute Weight:
weight_fix <- function(attribute){
cleaned_attribute <- as.character(attribute)
#remove "lbs"
for (i in grep("lbs",cleaned_attribute)){
cleaned_attribute[i] <- gsub("lbs","",cleaned_attribute[i])
}
cleaned_attribute <- as.numeric((cleaned_attribute))
return(cleaned_attribute)
}
fifa$Weight <- weight_fix(fifa$Weight)
# Cleaning attribute Height:
height_fix <- function(attribute){
cleaned_attribute = as.character(attribute)
# for values with foot only
# multiply feet by 30.48 to get cm
for (i in !grep("\'",cleaned_attribute)){
ifelse(cleaned_attribute[i]=="",
cleaned_attribute[i] <- NA,
as.numeric(cleaned_attribute[i])*30.48)
}
#for values with foot and inches
for (i in grep("\'",cleaned_attribute)){
feet <- as.numeric(unlist(strsplit(cleaned_attribute[i],"\'"))[1])
inch <- as.numeric(unlist(strsplit(cleaned_attribute[i],"\'"))[2])
# 1 foot is 30.48 cm
# 1 inch is 2.54 cm
cleaned_attribute[i] <- feet*30.48 + inch*2.54
}
cleaned_attribute <- as.numeric(cleaned_attribute)
return(cleaned_attribute)
}
fifa$Height <- height_fix(fifa$Height)
[Checkpoint 4]: What are the mean values of these two columns?
cat(ifelse(all(c(round(mean(fifa[,8]),4)==164.1339, round(mean(fifa[,7]),4)==180.3887)), "Correct results!", "Wrong results.."))
## Correct results!
[Task 5]: What columns have missing values? List them below (Replace
Columns with missing values:
#Get columns with missing values
columns_with_na <- colnames(fifa)[apply(fifa,2,anyNA)]
library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
# Handling NA values
exclude <- c('ID','International.Reputation')
include <- setdiff(names(fifa),exclude)
to_impute <- fifa[include]
#MICE imputation based on random forests
imp_fifa <- mice (to_impute,m = 5,method ='rf',seed = 1)
##
## iter imp variable
## 1 1 Value Wage Release.Clause
## 1 2 Value Wage Release.Clause
## 1 3 Value Wage Release.Clause
## 1 4 Value Wage Release.Clause
## 1 5 Value Wage Release.Clause
## 2 1 Value Wage Release.Clause
## 2 2 Value Wage Release.Clause
## 2 3 Value Wage Release.Clause
## 2 4 Value Wage Release.Clause
## 2 5 Value Wage Release.Clause
## 3 1 Value Wage Release.Clause
## 3 2 Value Wage Release.Clause
## 3 3 Value Wage Release.Clause
## 3 4 Value Wage Release.Clause
## 3 5 Value Wage Release.Clause
## 4 1 Value Wage Release.Clause
## 4 2 Value Wage Release.Clause
## 4 3 Value Wage Release.Clause
## 4 4 Value Wage Release.Clause
## 4 5 Value Wage Release.Clause
## 5 1 Value Wage Release.Clause
## 5 2 Value Wage Release.Clause
## 5 3 Value Wage Release.Clause
## 5 4 Value Wage Release.Clause
## 5 5 Value Wage Release.Clause
## Warning: Number of logged events: 91
complete_fifa <- complete(imp_fifa,1)
# Putting columns not used on imputation back into "fifa" dataframe
fifa["Value"] <- complete_fifa["Value"]
fifa["Wage"] <- complete_fifa["Wage"]
fifa["Release.Clause"] <- complete_fifa["Release.Clause"]
#fifa <- cbind(complete_fifa),fifa[exclude])
[Checkpoint 5]: How many instances have at least one NA? It should be 0 now. How many columns are there? It should be 68 (remember to put back “ID” and “International.Reputation”).
cat(ifelse(all(sum(is.na(fifa))==0, ncol(fifa)==68), "Correct results!", "Wrong results.."))
## Correct results!
[Task 6]: Create a new attribute called “Position.Rating” that has the rating value of the position corresponding to the player. For example, if the player has the value “CF” on the attribute “Position”, then “Position.Rating” should have the number on the “CF” attribute. After that, remove the “Position” attribute from the data.
# Creating the attribute "Position.Rating"
# Iterating through all the rows
for (i in 1:nrow(fifa)){
#get the rating value of the position
position_name <- toString(fifa[i,67])
#get the value of the position name
val_position <- fifa[i,position_name]
#Enter the value as the Position.Rating for that row
fifa$Position.Rating[i] <- val_position
}
# Removing the attribute "Position"
#exclude the "Position" Column
exclude <- c('Position')
include <- setdiff(names(fifa),exclude)
#Include all columns except "Position"
fifa <- fifa[include]
[Checkpoint 6]: What’s the mean of the “Position.Rating” attribute created? How many columns are there in the dataframe? It should be 68 (remember to remove “Position”).
cat(ifelse(all(c(round(mean(fifa$Position.Rating),5) == 66.87067, ncol(fifa)==68)), "Correct results!", "Wrong results.."))
## Correct results!
[Task 7]: Performe PCA (Principal Component Analysis) on the columns representing ratings of positions (that is, attributes: LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB). Show the summary of the components obtained. Keep the minimum number of components to have at least 98.50% of the variance explained by them.. Remove the columns used for PCA. HINT: Function prcomp, remember to center and scale.
# Perform PCA
#First rating of position is LS
#Last rating of position is RB
fifa.pca <- prcomp(fifa[grep("LS",colnames(fifa)):
grep("RB",colnames(fifa))],
center = TRUE,
scale. = TRUE)
# Show Summary
summary(fifa.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 4.0535 2.9757 0.65474 0.48078 0.15613 0.11051
## Proportion of Variance 0.6319 0.3406 0.01649 0.00889 0.00094 0.00047
## Cumulative Proportion 0.6319 0.9725 0.98900 0.99789 0.99883 0.99930
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.08463 0.06869 0.06073 0.05198 9.543e-14 3.8e-15
## Proportion of Variance 0.00028 0.00018 0.00014 0.00010 0.000e+00 0.0e+00
## Cumulative Proportion 0.99957 0.99975 0.99990 1.00000 1.000e+00 1.0e+00
## PC13 PC14 PC15 PC16 PC17
## Standard deviation 1.311e-15 7.103e-16 5.991e-16 4.7e-16 3.668e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.0e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.0e+00 1.000e+00
## PC18 PC19 PC20 PC21 PC22
## Standard deviation 3.577e-16 3.577e-16 3.577e-16 3.577e-16 3.577e-16
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## PC23 PC24 PC25 PC26
## Standard deviation 3.577e-16 3.577e-16 2.135e-16 5.766e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00 1.000e+00
#First 3 Components have at least 98.50% of the variance explained by them
# Put the components back into "fifa" dataframe
fifa <- cbind(fifa,fifa.pca$x)
#Only keep the first 3 principle components PC1 PC2 PC3
fifa <- fifa[,1:71]
# Remove original columns used for PCA
exclude <- colnames(fifa[9:34])
include <- setdiff(names(fifa),exclude)
fifa <- fifa[include]
[Checkpoint 7]: How many columns exist in the dataset? It should be 45.
cat(ifelse(ncol(fifa)==45, "Correct results!", "Wrong results.."))
## Correct results!
[Bonus]: Use the code below to see which columns influenced the most each component graphically. Replace “fifa.pca” with the object result from the use of prcomp function.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_var(fifa.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
[Task 8]: Perform binarization on the following categorical attributes: “Preferred.Foot” and “Work.Rate”. HINT: R package “dummies”, function dummy.data.frame.
# Binarize categorical attributes
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
fifa <- dummy.data.frame(fifa,c("Preferred.Foot","Work.Rate"))
[Checkpoint 8]: How many columns exist in the dataset? It should be 54.
cat(ifelse(ncol(fifa)==54, "Correct results!", "Wrong results.."))
## Correct results!
[Task 9]: Remove attribute “ID” from “fifa” dataframe, save attribute “International.Reputation” on vector named “IntRep” and then also remove “International.Reputation” from “fifa” dataframe. Perform z-score normalization on “fifa”, except for columns that came from PCA. Finally combine the normalized attributes with those from PCA saving on “fifa” dataframe. HINT: Function scale.
#Remove attribute "ID" from "fifa" dataframe
exclude <- c('ID')
include <- setdiff(names(fifa),exclude)
fifa <- fifa[include]
#save attribute "International.Reputation"" on vector named IntRep
IntRep <- fifa$International.Reputation
#Also remove "International.Reputation" from fifa dataframe
exclude <- c('International.Reputation')
include <- setdiff(names(fifa),exclude)
fifa <- fifa[include]
# Normalize with Z-Score
fifa_normalized <- scale(fifa[1:49],center = TRUE,scale=TRUE)
#Combine normalized attributes with those from PCA
#saving on "fifa" dataframe
fifa <- cbind(fifa_normalized,fifa[50:52])
[Checkpoint 9]: How many columns exist in the dataset? It should be 52. What’s the mean of all the means of the attributes? Should be around zero.
cat(ifelse(ncol(fifa)==52, "Correct results!", "Wrong results.."))
## Correct results!
[Task 9]: Perform K-Means for values of K ranging from 2 to 15. Find the best number of clusters for K-means clustering, based on the silhouette score. Report the best number of clusters and the silhouette score for the corresponding clustering (Replace
# K-Means and Silhouette scores
library(cluster)
library(purrr)
#getting distance for dataset
dist_fifa <- dist(fifa)
set.seed(1)
#avg_sil function from https://uc-r.github.io/kmeans_clustering#silo
avg_sil <- function(k) {
km.res <- kmeans(fifa, centers = k,iter.max = 15, nstart = 25)
ss <- silhouette(km.res$cluster, dist_fifa)
mean(ss[, 3])
}
k.values <- 2:15
avg_sil_values <- map_dbl(k.values, avg_sil)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 806100)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 806100)
#based on code from https://uc-r.github.io/kmeans_clustering#silo
plot(k.values, avg_sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouettes")
#silhouette score for clusters:2(best number of clusters)
cat("silhouette score for 2clusters(best number of clusters): " ,avg_sil_values[1] ,"\n")
## silhouette score for 2clusters(best number of clusters): 0.2136819
Results found:
[Checkpoint 9]: Are there silhouette scores for K-Means with K ranging from 2 to 15? Were the best K and correspondent silhouette score reported?
[Task 10]: Perform K-means with the K chosen and get the resulting groups. Try out several pairs of attributes and produce scatter plots of the clustering from task 9 for these pairs of attributes. By inspecting these plots, determine a pair of attributes for which the clusters are relatively well-separated and submit the corresponding scatter plot.
# K-Means for best K and Plot
k_means_2cluster <- kmeans(fifa, centers = 2,iter.max = 15, nstart = 25)
summary(k_means_2cluster)
## Length Class Mode
## cluster 16122 -none- numeric
## centers 104 -none- numeric
## totss 1 -none- numeric
## withinss 2 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 2 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
plot1 <- plot(fifa[c("Aggression","Positioning")],main="Clustering for Attributes Pair: Aggression and Positioning",col = k_means_2cluster$cluster)
print("Attribute Pair Reactions and Vision have clusters relatively well-seperated")
## [1] "Attribute Pair Reactions and Vision have clusters relatively well-seperated"
plot2 <- plot(fifa[c("Reactions","Vision")],main="Clustering for Attributes Pair: Reactions and Vision",col = k_means_2cluster$cluster)
plot3 <- plot(fifa[c("LongPassing","LongShots")],main="Clustering for Attributes Pair: LongPassing and LongShots",col = k_means_2cluster$cluster)
plot4 <- plot(fifa[c("HeadingAccuracy","Penalties")],main="Clustering for Attributes Pair: HeadingAccuracy and Penalties",col = k_means_2cluster$cluster)
plot5 <- plot(fifa[c("Balance","BallControl")],main="Clustering for Attributes Pair: Balance and BallControl",col = k_means_2cluster$cluster)
[Checkpoint 10]: Is there at least one plot showing two attributes and the groups (colored or circled) reasonably separated?
[Task 11]: Sample randomly 1% of the data (set.seed(1)). Perform hierarchical cluster analysis on the dataset using the algorithms complete linkage, average linkage and single linkage. Plot the dendrograms resulting from the different methods (three methods should be applied on the same 1% sample). Discuss the commonalities and differences between the three dendrograms and try to explain the reasons leading to the differences (Replace the
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Sample and calculate distances
set.seed(1)
# Sample 1% of the dataset
fifa_sample <- sample_frac(fifa,0.01)
dist_sample <- dist(fifa_sample)
# Complete
hcluster_complete <- hclust(dist_sample,method = "complete")
plot(hcluster_complete,main="Complete Linkage Cluster Dendrogram",xlab = "Sample of Fifa Dataset")
# Average
hcluster_average <- hclust(dist_sample,method = "average")
plot(hcluster_average, main="Average Linkage Cluster Dendrogram",xlab = "Sample of Fifa Dataset")
# Single
hcluster_single <- hclust(dist_sample,method = "single")
plot(hcluster_single,main="Single Linkage Cluster Dendrogram",xlab = "Sample of Fifa Dataset")
Discussion:
Commonalities: All three algorithms are agglomerative hierachical clustering. They all start with each data point as a single cluster, then clusters join together into bigger clusters until one single cluster is formed with all the data points. The two clusters closest together are merged into one cluster at each iteration.
Differences: The method to calculate the distance between one cluster and another cluster for closeness determines the differences between the three algorithms.
Complete Linkage: The distance is defined as the farthest data point from one cluster to the farthest data point from the other cluster. So the clusters with the smallest diameters will merge together at each iteration.
Single Linkage: The distance is defined as the closest data point from one cluster to the closest data point from the other cluster.So the clusters with data points closest to datapoints of nearby clusters will merge together at each iteration.
Average Linkage: The distance is defined as an average distance of each point in one cluster to every point in the other cluster.
The dendrograms plotted illustrate the differences between the three algorithms. Single Linkage will merge the clusters with data points closest to each other so the dendrogram is more likely to form long chains.
Complete Linkage will merge clusters with big diameters later, so the dendrogram is more likely to have uniform sized trees.
Average Linkage is the middle ground between single linkage and complete linkage so the dendrogram looks in between the dendrograms for single and complete linkage.
[Checkpoint 11]: Does the discussion show commonalities and differences between the three dendrograms and explain the differences?
[Task 12]: Now perform hierarchical cluster analysis on the ENTIRE dataset using the algorithms complete linkage, average linkage and single linkage. Cut all of the three dendrograms from task 11 to obtain a flat clustering with the number of clusters determined as the best number in task 9.
To perform an external validation of the clustering results, use the vector “IntRep”" created. What is the Rand Index for the best K-means clustering? And what are the values of the Rand Index for the flat clusterings obtained in this task from complete linkage, average linkage and single linkage? Discuss the results (Replace
# Hierarchical Clusterings (Complete, Average and Single)
#Hierachical Clustering Complete Linkage
hclust_complete <- hclust(dist_fifa,method = "complete")
#Hierachical Clustering Average Linkage
hclust_average <- hclust(dist_fifa,method = "average")
#Hierachical Clustering Single Linkage
hclust_single <- hclust(dist_fifa,method = "single")
# Flat Clusterings
#Split into 2 clusters based on best cluster number for kmeans
#Complete Linkage Flat Clustering
flat_complete <- cutree(hclust_complete,2)
#Average Linkage Flat Clustering
flat_average <- cutree(hclust_average,2)
#Single Linkage Flat Clustering
flat_single <- cutree(hclust_single,2)
# Cluster Similarities
library(clusteval)
#Rand Index for Best K-means clustering (2 Clusters)
rand_kmeans <- cluster_similarity(IntRep,k_means_2cluster$cluster,similarity = "rand")
cat("Rand Index for Best K-means clustering (2 Clusters): ", rand_kmeans ,"\n")
## Rand Index for Best K-means clustering (2 Clusters): 0.4999695
#Rand Index for Complete Linkage Flat Clustering
rand_flat_complete <- cluster_similarity(IntRep,flat_complete,similarity = "rand")
cat("Rand Index for Complete Linkage Flat Clustering: " ,rand_flat_complete ,"\n")
## Rand Index for Complete Linkage Flat Clustering: 0.836581
#Rand Index for Average Linkage Flat Clustering
rand_flat_average <- cluster_similarity(IntRep,flat_average,similarity = "rand")
cat("Rand Index for Average Linkage Flat Clustering: ",rand_flat_average ,"\n")
## Rand Index for Average Linkage Flat Clustering: 0.8356267
#Rand Index for Single Linkage Flat Clustering
rand_flat_single <- cluster_similarity(IntRep,flat_single,similarity = "rand")
cat("Rand Index for Single Linkage Flat Clustering: ", rand_flat_single ,"\n")
## Rand Index for Single Linkage Flat Clustering: 0.8274063
Discussion:
Rand Index for K-means clustering is around 0.5 while the Rand Index for the Flat Clusterings are all around 0.83. The higher Rand Index values for Hierachical clusters means that there is more similarity between the Hierachical clusters and the IntRep data compared to the lower similarity between K-Means clusters and the IntRep data.This suggests that Hierarchical Clustering produces better clusters compared to K-Means Clustering for the fifa dataset.
Possible reasons that Hierachical clustering produced better clusters is that the K-Means algorithm assumes clusters are spherical, that all the attributes have around the same variance and that every cluster is around the same size. These assumptions may not have been valid for this dataset.
This can be visualized in the scatter plot for K-Means in Task 9. The clusters do not form well defined spherical clusters.
Also some of the attributes in the dataset have a large variance in values while other attributes have small variance. Although normalization was applied to the dataset, it can be difficult to decide on the correct scaling for the dataset.
Hierachical Clustering does not follow the assumptions that K-Means contains. It is better for non-spherical clusters where a cluster may be the closest data points which form a path. The clusters also don’t need to have around the same size of datapoints which is better for this dataset, because of the large number of attributes which makes it unlikely that all clusters are around the same size.
[Checkpoint 12]: Does the discussion include relevant comparison of the clusters and makes sense?